Lecture’s plan

  1. Feed-forward neural networks
  2. Recurrent neural networks
    1. SRN
    2. LSTM
    3. Bi-LSTM
    4. GRU

What is Deep Learning (DL) ?

A machine learning subfield of learning representations of data. Exceptional effective at learning patterns.

Deep learning algorithms attempt to learn (multiple levels of) representation by using a hierarchy of multiple layers.

If you provide the system tons of information, it begins to understand it and respond in useful ways.

Feed-forward neural networks

  • Multi-layer networks can represent arbitrary functions, but an effective learning algorithm for such networks was thought to be difficult.
  • A typical multi-layer network consists of an input, hidden and output layer, each fully connected to the next, with activation feeding forward.

  • The weights determine the function computed. Given an arbitrary number of hidden units, any boolean function can be computed with a single hidden layer.

Introduction

\[h = \sigma(W_1x + b_1)\] \[y = \sigma(W_2h + b_2)\]

Introduction

One forward pass

Training

Gradient descent

  • Define objective to minimize error: \[E(W) = \sum_{d \in D} \sum_{k \in K} (t_{kd} - O_{kd})^2\]

    where \(D\) is the set of training examples, \(K\) is the set of output units, \(t_{kd}\) and \(o_{kd}\) are, respectively, the teacher and current output for unit \(k\) for example \(d\).
  • The derivative of a sigmoid unit with respect to net input is: \[\frac{\partial{o_j}}{\partial{net_j}} = o_j(1-o_j)\]
  • Learning rule to change weights to minimize \[\Delta w_{ji} = - \eta \frac{\partial{E}}{\partial{w_{ji}}}\]

Backpropagation learning rule

  • Each weight changed by: \[\Delta w_{ji} = \eta \delta_j o_i \] \[\delta_j = o_j(1 - o_j)(t_j - o_j) \ \ \ \text{if } j \text{ is an output unit}\] \[\delta_j = o_j(1 - o_j)\sum_k{\delta_k w_{kj}} \ \ \ \text{if } j \text{ is a hidden unit}\]

    where \(\eta\) is a constant called the learning rate

    \(t_j\) is the correct teacher output for unit \(j\)

    \(\delta_j\) is the error measure for unit \(j\)

Updating weights

objective/cost function \(J\)\((\theta)\)

Update each element of \(\theta\):

\[\theta^{new}_j = \theta^{old}_j - \alpha \frac{d}{\theta^{old}_j} J(\theta)\]

Matrix notation for all parameters ( \(\alpha\): learning rate):

\[\theta^{new}_j = \theta^{old}_j - \alpha \nabla _{\theta}J(\theta)\]

Recursively apply chain rule though each node

Notes on training

  • Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely.
  • However, in practice, does converge to low error for many large networks on real data.
  • Many epochs (thousands) may be required, hours or days of training for large networks.
  • To avoid local-minima problems, run several trials starting with different random weights (random restarts).
    • Take results of trial with lowest training set error.
    • Build a committee of results from multiple trials (possibly weighting votes by training set accuracy).

Hidden unit representations

  • Trained hidden units can be seen as newly constructed features that make the target concept linearly separable in the transformed space.
  • On many real domains, hidden units can be interpreted as representing meaningful features such as vowel detectors or edge detectors, etc..
  • However, the hidden layer can also become a distributed representation of the input in which each individual unit is not easily interpretable as a meaningful feature.

Overfitting

 

Learned hypothesis may fit the training data very well, even outliers ( noise) but fail to generalize to new examples (test data)

Overfitting prevention

  • Running too many epochs can result in over-fitting.

  • Keep a hold-out validation set and test accuracy on it after every epoch. Stop training when additional epochs actually increase validation error.
  • To avoid losing training data for validation:
    • Use internal 10-fold CV on the training set to compute the average number of epochs that maximizes generalization accuracy.
    • Train final network on complete training set for this many epochs.

Regularization

Dropout
  • Randomly drop units (along with their connections) during training
  • Each unit retained with fixed probability \(p\), independent of other units
  • Hyper-parameter \(p\) to be chosen (tuned)

L2 = weight decay
  • Regularization term that penalizes big weights, added to the objective \(J_{reg}(\theta) = J(\theta) + \lambda\sum_k{\theta_k^2}\)
  • Weight decay value determines how dominant regularization is during gradient computation
  • Big weight decay coefficient &rarr big penalty for big weights
Early-stopping
  • Use validation error to decide when to stop training
  • Stop when monitored quantity has not improved after n subsequent epochs
  • \(n\) is called patience

Tuning hyperparameters

Loss functions and output

Determining the best
number of hidden units

  • Too few hidden units prevents the network from adequately fitting the data.
  • Too many hidden units can result in over-fitting.

  • Use internal cross-validation to empirically determine an optimal number of hidden units.

Recurrent Neural Networks

Recurrent Neural Network (RNN)

  • Add feedback loops where some units’ current outputs determine some future network inputs.
  • RNNs can model dynamic finite-state machines, beyond the static combinatorial circuits modeled by feed-forward networks.

Simple Recurrent Network (SRN)

  • Initially developed by Jeff Elman (“Finding structure in time,” 1990).
  • Additional input to hidden layer is the state of the hidden layer in the previous time step.

Unrolled RNN

  • Behavior of RNN is perhaps best viewed by “unrolling” the network over time.

Training RNNs

  • RNNs can be trained using “backpropagation through time.”
  • Can viewed as applying normal backprop to the unrolled network.

LSTM

Vanishing gradient problem

  • Backpropagated errors multiply at each layer, resulting in exponential decay (if derivative is small) or growth (if derivative is large).
  • Makes it very difficult train deep networks, or simple recurrent networks over many time steps.

Suppose we had the following scenario:

Day 1: Lift Weights

Day 2: Swimming

Day 3: At this point, our model must decide whether we should take a rest day or yoga. Unfortunately, it only has access to the previous day. In other words, it knows we swam yesterday but it doesn’t know whether had taken a break the day before.Therefore, it can end up predicting yoga.

  • LSTMs were invented, to get around this problem.

https://towardsdatascience.com/

Long Short Term Memory

  • LSTM networks, add additional gating units in each memory cell.
    • Forget gate
    • Input gate
    • Output gate
  • Prevents vanishing/exploding gradient problem and allows network to retain state information over longer periods of time.

LSTM network architecture

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Cell state

  • Maintains a vector \(C_t\) that is the same dimensionality as the hidden state, \(h_t\)
  • Information can be added or deleted from this state vector via the forget and input gates.

Forget gate

  • Forget gate computes a 0-1 value using a logistic sigmoid output function from the input, \(x_t\), and the current hidden state, \(h_t\):
  • Multiplicatively combined with cell state, “forgetting” information where the gate outputs something close to 0.

\[f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\]

Input gate

  • First, determine which entries in the cell state to update by computing 0-1 sigmoid output.
  • Then determine what amount to add/subtract from these entries by computing a tanh output (valued –1 to 1) function of the input and hidden state.

\[i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)\] \[\tilde{C}_t = tanh(W_C \cdot [h_{t-1}, x_t] + b_C)\]

Updating the cell state

  • Cell state is updated by using component-wise vector multiply to “forget” and vector addition to “input” new information.

\[C_t = f_t * C_{t-1} + i_t * \tilde{C}_t\]

Output gate

  • Hidden state is updated based on a “filtered” version of the cell state, scaled to –1 to 1 using tanh.
  • Output gate computes a sigmoid function of the input and current hidden state to determine which elements of the cell state to “output”.

\[o_t = \sigma(W_o[h_{t-1}, x_t] + b_o)\] \[h_t = o_t * tanh(C_t)\]

LSTM training

  • Trainable with backprop derivatives such as:
    • Stochastic gradient descent (randomize order of examples in each epoch) with momentum (bias weight changes to continue in same direction as last update).
    • ADAM optimizer (Kingma & Ma, 2015)
  • Each cell has many parameters (\(W_f\), \(W_i\), \(W_C\), \(W_o\))
    • Generally requires lots of training data.
    • Requires lots of compute time that exploits GPU clusters.

General problems solved with LSTM

  • Sequence labeling
    • Train with supervised output at each time step computed using a single or multilayer network that maps the hidden state (\(h_t\)) to an output vector (\(O_t\)).
  • Language modeling
    • Train to predict next input (\(O_t =I_{t+1}\))
  • Sequence (e.g. text) classification
    • Train a single or multilayer network that maps the final hidden state (\(h_n\)) to an output vector (\(O\)).

Sequence to sequence
transduction (mapping)

  • Encoder/Decoder framework maps one sequence to a “deep vector” then another LSTM maps this vector to an output sequence.

  • Train model “end to end” on I/O pairs of sequences.

Summary of
LSTM application architectures

Successful applications of LSTM

  • Speech recognition: Language and acoustic modeling
  • Sequence labeling
  • Neural syntactic and semantic parsing
  • Image captioning: CNN output vector to sequence
  • Sequence to Sequence
    • Machine Translation (Sustkever, Vinyals, & Le, 2014)
    • Video Captioning (input sequence of CNN frame outputs)

Bi-directional LSTM (Bi-LSTM)

  • Separate LSTMs process sequence forward and backward and hidden layers at each time step are concatenated to form the cell output.

Gated Recurrent Unit (GRU)

  • Alternative RNN to LSTM that uses fewer gates (Cho, et al., 2014)
    • Combines forget and input gates into “update” gate.
    • Eliminates cell state vector

\[z_t = \sigma(W_z \cdot [h_{t-1}, x_t])\] \[r_t = \sigma(W_r \cdot [h_{t-1}, x_t])\] \[\tilde{h}_t = tanh(W \cdot [r_t * h_{t-1}, x_t])\] \[h_t = (1 - z_t) * h_{t-1} + z_t * \tilde(h)_t\]

GRU versus LSTM

  • GRU has significantly fewer parameters and trains faster.
  • Experimental results comparing the two are still inconclusive, many problems they perform the same, but each has problems on which they work better.

Attention

  • For many applications, it helps to add “attention” to RNNs.
  • Allows network to learn to attend to different parts of the input at different time steps, shifting its attention to focus on different aspects during its processing.
  • Used in image captioning to focus on different parts of an image when generating different parts of the output sentence.
  • In MT, allows focusing attention on different parts of the source sentence when generating different parts of the translation.

Summary

Summary

  • Feed-forward neural networks
  • Backpropagation
  • Recurrent neural networks
  • SRN
  • LSTM
  • Bi-LSTM
  • GRU

Time for Practical 6!